WISCONSIN BREAST CANCER EDA

In this project we will show potential of the machine learning to enhance cancer detection accuracy while also speeding up diagnosis. On the other hand, humans rely on a small set of clearly quantifiable criteria to diagnose, machine learning may uncover deeper trends and patterns in test findings. Perhaps most crucially, machine learning has the potential to increase the percentage of patients detected early, which has been shown to treble cancer survival rates.

Project will be done in 3 sections, these are:

Let's dive deep to understand how machine learning determines if the cell is malignant or benign.

1. Introduction & First Look

1.1 Importing Libraries

In case you don't have imblearn(SMOTE) or missingno library, you can download by uncommenting pip install commands below

1.2 Importing & Describing Data

Because of headers are not given in the data set, but mentioned how are they located at wdbc.names we will set headers for each column. Thus, column names will be:

By describing data and checking column names it can be understood that all data is numerical but diagnosis. Diagnosis is a categorical feature.

2. Preprocessing & Visualization

2.1 Checking Null Data

Data is visualized with missingno library to check if there is any null values in data. In addition, to demonstrate numerically it is printed with code below graph.

Let's show counts and percentages of output feature (diagnosis). By comparing their counts it is a little imbalanced data set. We can use under-sampling or over-sampling to eliminate this problem. Because of under-sampling causes lot of information loss over-sampling will be used.

2.2 Checking Duplicates

With the purpose of checking if there is null data by comparing all rows and same id's.

2.3 Scaling & Supressing

To decrease computational power features will be scaled by the standard scaler. With this, we still keep same distances between data and decrease the consumed time to train machine learning model. Additionally, id column will be dropped since it is not required to estimate category of the output. Also, considering to process new observed datas scalers are saved in dictionary and that dictionary will be saved into general dictionary called as parameters_dict.

Seems every feature has outliers, so outliers will be suppressed with interquartile ranges.

With the aim of double check if outliers supressed or not, they are visualized with boxplot again.

2.4 Pairplots

Now, to visualize data from different perspectives each main category (main, standard error, worst) will be shown considering diagnosis feature. Purple represents malignant cells and orange for benign cells.

2.4.1 Features Encoded as Mean

2.4.2 Features Encoded as Standard Error

2.4.3 Features Encoded as Worst

Almost all features looks like distiguish from each other at significant rate. This shows us to machine learning models will be generally successfull at predicting cancer cells. It can be seen from the figures that benign data distribution generally concatrates on lesser values then malignant ones.

2.5 Correlations & Heatmap

Correlations between features are visualized with heatmap. Some feature correlations between features are 1.0 or so close to 1.0. It means, these feature can be represented by each other, so we can remove one of these each highly correlated features to reduce dimensionalty curse.

To visualize each feature correlation to diagnosis feature which is our output, their correlation are visualized with barplot.

2.6 Oversampling

Even data is not over imbalanced, it will tend to predict more benign data if we dont equalize their data occurance. The reason of it, is because of machine learning model will update its weights with respect to benign cells more than malignant ones.

2.7 Saving Outputs

With the purpose of using the data without processing again and again it will be saved. Moreover, parameters will be saved for further data.

3. Train & Evaluation

In this section 5 machine learning models will be trained at total, which are:

After training each model their scores from different metrics will be shown to visualize their performances. Also, indices that model failed at prediction will be printed to understand reason of why machine learning model fails to predict that data. Then, their scores will be compared. After comparison, best model will be selected and commented, to understand reason of why it was better than others.

3.1 SVM - Support Vector Machine

SVM takes all the data and generates a line that is called hyperplane which divides data. This line is known as decision boundary. For simple SVM, each side of the hyperplane represents a category.

3.2 Logistic Regression

Logistic Regression is pretty similiar to Linear Regression however it is used to solve classification problems with sigmoid function.

3.3 Random Forest

Random Forest builds an ensemble of decision trees. It does it with bagging method which uses different subsets of training data.

3.4 Decision Tree

There are three main characteristics of decision tree:

  1. Internal nodes represent the features of a dataset
  2. Branches means decision rules of tree
  3. Leaf nodes represent outcomes of tree

3.5 KNeighborsClassifier

KNN considers each data point that will be predicted by checking next data points to it. So, it compares by similarities and determines the output.

3.6 Visualizing Wrong Predicted Data

Instead of visualizing wrong predictions of all models with plots, I will be showing only logistic regression failure predictions. It can be also switched to other machine learning algorithms just by changing the machine learning model that predicts y_prediction. At first I was about to plot pairplots, but because of there is 30 features it would take lot of space or too small for us to see. So, what I have done to visualize is obtaining wrong predictions, then subtracting their each mean of future to ground truth feature means.

By analysing these 2 plots, means of these values highly differ from actual features which helps us to understand why machine learning model predicted wrong these data points.

3.7 Feature Importance of Logistic Regression

With same reason of visualing wrong predictions of logistic regression, only feature importances of logistic regression is shown. Ranking of features are similiar to ranking of correlations shown at visualization section. I obtained coefficients from trained logistic regression and matched with feature names then sorted regarding to absolute value of each coefficient.

3.8 Conclusion

Since, malignant and benign data characteristics distinguish from each other ,as I explained data visualization section, all models performed well. Their result scores are very alike. However, by comparing all trained machine learning models it can be seen logistic regression performed best. There may be few reasons of why other models not performed better than logistic regression.

  1. Logistic regression performs well on linearly seperable data sets.
  2. Total number of over-sampled data is 714,it is actually low when compared to real life data.
  3. Training machine learning models with default parameters instead of parameter tuning.

With the aim of reaching better results at prediction voting classifiers, gradient boost based algorithms, parameter tuning, collecting new features or data can be tried.